DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

作者信息:MIT Song Han实验室

链接:Duoattention: Efficient long-context llm inference with retrieval and streaming heads

摘要

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in this https URL.

一句话总结概括

Motivation

创新点或贡献

具体设计

实验评估

背景

先前工作存在的问题概述

难点

补充背景

思考角度

我如何做这个问题

这个洞见可以引申出其他其他方法吗

该洞见是否可以迁移到其他领域中

该工作有什么可能可以改进的地方

Q&A

results matching ""

    No results matching ""